Q-DETR: An Efficient Low-Bit Quantized Detection Transformer
35
FIGURE 2.12
(a) We select τ and λ using 4-bit Q-DETR-R50 on VOC. (b) The mutual information curves
of I(X; E) and I(yGT ; E, q) (Eq. 2.27) on the information plane. The red curves represent
the teacher model (DETR-R101). The orange, green, red, and purple lines represent the
4-bit baseline, 4-bit baseline + DA, 4-bit baseline + FQM, and 4-bit baseline + DA +
FQM (4-bit Q-DETR).
memory. We use ImageNet ILSVRC12 [123] to pre-train the backbone of a quantized stu-
dent. The training protocol is the same as the employed frameworks [31, 70]. Specifically,
we use a batch size of 16. AdamW [164] is used to optimize the Q-DETR, with the ini-
tial learning rate of 1e−4. We train for 300/500 epochs for the Q-DETR on VOC/COCO
dataset, and the learning rate is multiplied by 0.1 at the 200/400-th epoch, respectively.
Following the SMCA-DETR, we train the Q-SMCA-DETR for 50 epochs, and the learning
rate is multiplied by 0.1 at the 40th epoch on both the VOC and COCO datasets. We utilize
a multi-distillation strategy, saving the encoder and decoder network as real-valued at the
first stage. Then we train the fully quantized DETR at the second stage, where we load
the weight from the checkpoint of the first stage. We select real-valued DETR-R101 (84.5%
AP50 on VOC and 43.5% AP on COCO) and SMCA-DETR-R101 (85.3% AP50 on VOC
and 44.4% AP on COCO) as teacher network.
Hyper-parameter selection. As mentioned, we select hyper-parameters τ and λ in
this part using the 4-bit Q-DETR model. We show the model performance (AP50) with
different setups of hyper-parameters {τ, λ} in Fig. 2.12 (a), where we conduct ablative ex-
periments on the baseline + DA (AP50=78.8%). As can be seen, the performances increase
first and then decrease with the increase of τ from left to right. Since τ controls the propor-
tion of selected distillation-desired queries, we show that the full-imitation (τ = 0) performs
worse than the vanilla baseline with no distillation (τ = 1), showing query selection is nec-
essary. The figure also shows that the performances increase first and then decrease with
the increase of τ from left to right. The Q-DETR performs better with τ set as 0.5 and 0.6.
With the varying value of λ, we find {λ, τ} = {2.5, 0.6} boost the performance of Q-DETR
most, achieving 82.7% AP on VOC test2007. Based on the ablative study above, we set
hyper-parameters τ and λ as 0.6 and 2.5, respectively, for the experiments in this paper.
Effectiveness of components. We show quantitative component improvements in Q-
DETR in Table 2.2. As shown in Table 2.2, the quantized DETR baseline suffers a severe
performance drop on AP50 (13.6%, 6.5%, and 5.3% with 2/3/4-bit, respectively). DA and
FQM improve the performance when used alone, and the two techniques further boost the
performance considerably when combined. For example, the DA improves the 2-bit baseline